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Abstract 



For complex educational assessments, there is an increasing use of item families^ which are groups 
of related items. However, calibration or scoring for such an assessment requires fitting models 
that take into account the dependence structure inherent among the items that belong to the same 
item family. Glas and van der Linden (2001) suggest a Bayesian hierarchical model to analyze 
data involving item families with multiple choice items. This paper extends the model to take 
into account item families with constructed response items, and designs a Markov chain Monte 
Carlo (MCMC) algorithm for the Bayesian estimation of the model parameters. The hierarchical 
model, which accounts for the dependence structure inherent among the items, implicitly defines 
the family response function (FRF) for the score categories. This paper suggests a way to 
combine the FRFs over the score categories to obtain a family score function (FSF), which is a 
quick graphical summary of the expected score of an individual with a certain ability to an item 
randomly generated from an item family. This paper also suggests a method for the Bayesian 
estimation of the FRF and FSF. This work is a significant step towards building a tool to analyze 
data involving item families and may be very useful practically, for example, in automatic item 
generation systems that create tests involving item families. 



Key words: Hierarchical model; Markov chain Monte Carlo; Automatic item generation;; Family 
response function; Family score function; Item score function 
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1. Introduction 



The operation of large-scale high-stakes testing programs demands a large pool of 
high-quality items from which items can be sampled. Large pools are especially important 
for assessment programs that offer flexible assessment times such as the Graduate Records 
Examination (GRE), where concerns over item exposure and potential disclosure are the greatest. 
While efforts to populate item pools are laborious for pools consisting entirely of multiple 
choice items, the same efforts for complex constructed response tasks are even more challenging. 
In response to the effort, expense, and occasionally inconsistent item quality associated with 
traditional item production there is an increasing interest in using item models to guide production 
of items, automatically or manually, with similar conceptual and statistical properties. Irvine and 
Kyllonen (2002) surveys some current areas of investigation for item modeling and generation. 
Items produced from a single item model, whether by automatic item generation (AIG) systems 
or by rigorous manual procedures, are related to one another through the common generating 
model, and therefore constitute a family of related items. 

Naturally, it is necessary and beneflcial to use calibration models that account for the 
dependence structure among the items from the same item family. The works by Janssen, 
Tuerlinckx, Meulders, and De Boeck (2000) and Wright (2002) are initial attempts at building 
such models. Glas and van der Linden (2001) suggest one such model for multiple choice items 
that is more general. The model assumes the item parameters of a three-parameter logistic model 
(3PL; Lord, 1980) to be normally distributed — the mean vector and the variance matrix of the 
normal distribution depend on the item model from which the item is generated. 

Glas and van der Linden’s model has some similarity to the testlet model of Bradlow, 
Wainer, and Wang (1999). Both models describe an extra level of dependence in the observed 
assessment data. However the testlet model describes the extra “local” dependence between a 
single examinee’s item responses within a testlet, whereas the item family model explains the 
dependence between all examinees’ responses to the same single member from an item family. 

Glas and van der Linden’s model has the limitation that it cannot take into account 
families with constructed response items. This paper generalizes Glas and van der Linden’s model 
to take into account item families with constructed response items. Further, this work designs a 
method to estimate the joint posterior distribution of the model parameters using the Markov 
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chain Monte Carlo (MCMC; Gilks, Richardson, & Spiegelhalter, 1996) algorithm. 

The hierarchical model implies a family response function (FRF) for each category for an 
item family. The FRF for category k for an item family gives the probability that an individual 
with a given ability will score k on an item randomly generated from the item family. The idea 
is similar to that behind the family expected response function defined by Sinharay, Johnson, and 
Williamson (2003) who deal with families with dichotomous items. This paper suggests a way to 
combine the FRFs over the score categories to obtain a family score function (FSF), which is a 
quick graphical summary of the expected score of an individual with a given ability to an item 
randomly generated from an item family. This work also suggests a way to compute estimates of 
the FRF and FSF and an approximate prediction interval around them using the Monte Carlo 
method and the output of the MCMC algorithm. The paper examines the performance of the 
hierarchical model using both simulated data and real data examples . 

The next section provides a broad overview of the existing techniques for the analysis of 
data involving item family. Section 3 describes in detail one such model — the model generalizes 
Glas and van der Linden’s hierarchical model to account for constructed response item family 
data as well. Section 4 discusses estimation of the model parameters and the family expected 
response functions using the Markov chain Monte Carlo (MCMC) algorithm. Section 5 reports 
the results from a simulation study. Section 6 discusses the application of the model to a data set 
from the National Assessment of Educational Progress (NAEP). Finally, the paper concludes with 
a summary of the findings and thoughts on possible future directions. 

2. Models for item families 

There are three approaches for modeling data involving item families. The models all 
build from standard item response theory (IRT) models for dichotomous and polytomous data. 
This paper uses the three-parameter logistic model (3PL; Birnbaum 1968) to describe the response 
behavior of examinees to multiple choice items, and the generalized partial credit model (GPCM; 
Muraki 1992) to describe the response behavior of examinees to constructed response items. 

The 3PL model assumes that the probability an individual with ability 9i correctly 
responds to item j is defined by the following equation 

Pji{0) = P{Xij = l\ei,aj, pj, Cj) = Cj + 1 +exp{aj(^j -0)}' 
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The parameter Cj is called the asymptote, aj is the item’s discrimination, and is the difficulty 
of the item. Pjo{^) = 1 — Pji{G) is the probability that an examinee with ability 0i incorrectly 
responds to item j. 

The GPCM assumes that the adjacent category logits are linear in the examinee’s 
proficiency 6i. Mathematically the GPCM probabilities for an item with K score categories 
(0, 1, . . . , — 1) are defined by the following: 

Pjk{0i) = ^{^ij ~ k\0i^dj^ • • • ^jK—1^ 

^ exp{kaj{ei - 13 j)- aj J2e=o ^je} ^2) 

exp{A:aj(6>i - I3j) - aj E^=o ^je} 

for fc = 0, 1, . . . , “ 1, where aj is the item discrimination, Pj is the overall difficulty for the 

item and 5jk is the k-th item-category step parameter. To ensure identifiability, set Sjo = 0 
and ^^0^ ~ which results in the need of estimating only {K — 2) parameters out of 

to fit the GPCM. Note that the two-parameter logistic model (2PL; Birnbaum, 
1968) is a special case of this model with K=2 and no 5-parameter. 

Unrelated Siblings Model 

The gold standard approach for modeling item response functions is to assume that each 
item is independent of all other items, regardless of whether they are siblings or not. We call 
this approach the Unrelated Siblings Model (USM). The USM assumes that examinee i’s 
score to item j, follows a multinomial or binomial distribution with probabilities defined by the 
generalized partial credit model in (2) for constructed response items and probabilities defined by 
the three-parameter logistic model in (1) for multiple choice items. The USM makes no other 
assumption about the item parameters. 

Although standard IRT software can fit the USM, the model has the disadvantage that 
each item has to be individually calibrated. In addition, this approach ignores the relationship 
between siblings in an item family and hence will provide standard errors of item parameters that 
are too large, and will require larger sample sizes for acceptable calibration precision. 

Identical Siblings Model 

Hombo and Dresher (2001) study the results of a model that assumes the same item 
response function for all items in the same item family. We call this approach the Identical Siblings 
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Model (ISM). This model assumes the item parameters of the 3PL and GPCM remain constant for 
all items in the family (e.g. Pj = Px(j))- While this model can also be fit by standard software like 
PARSCALE (Muraki and Bock, 1997) or BILOG (Mislevy and Bock, 1982), it has the limitation 
that it ignores any variation between siblings and hence, in the face of such variations, provides 
incorrect estimates of the item parameters, and examinee scores. 

Related Siblings Model 

One way to overcome the limitations of the above-mentioned two methods is to apply 
the Related Siblings Model (RSM), a hierarchical model that assumes a separate item response 
function for each item, but relates siblings within a family using a hierarchical component (Glas 
and van der Linden 2001). The method uses a mixing distribution to describe the relationship 
between items within the same item family, in much the same way that the mixing distribution 
on the student parameter 9 in an IRT model is used to describe the dependence between item 
responses from the same examinee. 

One important point to note here is that the ISM and USM are limiting cases of the RSM. 
If the mixing distribution approaches to a point mass (or the variances of the mixing distribution 
go to 0), then the RSM approaches in the limit to the ISM. On the other hand, if the mixing 
distribution approaches to the Lebesgue measure (or the variances of the mixing distribution go 
to oo), then the RSM approaches in the limit to the USM (Sinharay, Johnson, &; Williamson, 
2003). 

While the advantage of the RSM is that it properly accounts for the variability among 
the items for the same item model, it has the disadvantage that there is no standard software for 
fitting this model. We use our own C++ program to fit the RSM in this work. 

Janssen, Tuerlinckx, Meulders, and De Boeck (2000) and Wright (2002) provide examples 
of such models. Glas and van der Linden (2001) suggest one such model for multiple choice items 
that is more general; the model starts from a three-parameter logistic model (3PL; Lord, 1980) 
and uses a normal mixing distribution to relate the item parameters belonging to the same family. 
The mean vector and the variance matrix of the normal mixing distribution depend on the item 
family from which the item is generated. 

All the above-mentioned models have the limitation that they cannot take into account 
families with constructed response items. The next section discusses an example of an RSM 
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that encompasses constructed response items as well. The model is an extension of the model 
by Glas and van der Linden (2001), is a Bayesian hierarchical model, and uses a normal mixing 
distribution to relate siblings. 



3. An RSM for Constructed Response Item Families 



Suppose there are J items denoted by j = 1, 2, ... J in a test, and that the j-th item is 
scored on a scale from 0 to {Kj — 1). Consider that the test is given to N examinees. Let X{j) be 
the item family of which item j is a member. Items j and k are siblings if they are members of 
the same item family, i.e., if X(j) = I{k), 

We model multiple choice (MC) items using the 3PL model defined in (1) and 
constructed response (CR) items using the GPCM defined in (2). To be able to use a normal 
mixing distribution on the item parameters, we apply the transformations aj = log{aj} and 
7j = log I Assuming normality of aj and 7j, both of which range from — oo to oo (whereas 
aj ranges from 0 to oo and Cj ranges from 0 to 1) is quite reasonable. Recall that fitting a 3PL 
model requires estimating a^, /5j, and Cj for each item while fitting a GPCM requires estimating 
Uj, /5j, and any {Kj — 2) out of 6ji^ 6j2^ ■ . ■ ^j Kj-i ' 



^7 = \ 



( 3 ) 



(aj,/5j,7j)^ if item j is an MC item 

(aj, /5j, 6 j 2 ^ . . . ^ item 

be the item parameter vector for item j. Then the hierarchical model defining the likelihood of 
the related siblings model is 

Xij ~ Multinomial(l; Pjo{Oi 
9 i ~ Af{n, a^) 
rij ~ 

where the probabilities PjQ, . . . ,PjKj-i are defined in (1) or (2) depending on whether the item is 
an MC or CR item, 



( 4 ) 






^gI(j)Y 



if item j is an MC item 



( 5 ) 



i^aX{j)^ ^bx(j)^ ^diX{j)y ^d 2 X(j)y • • • ^d j ^ . item j is a CR item 
is the mean item parameter vector for family X(ji'), and Tj^j) is the within- family item parameter 
covariance matrix for family T{j). We will call the A^s as the family discrimination parameters. 
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and the A(,s as the family difficulty parameter. To fix the origin and scale (ensure identifiability), 
let /li = 0 and cr^ = 1 (an alternative method would be to force sum-to-zero constraints on the first 
and second components of the item family mean parameter vectors Note that expectations 

are not invariant under transformations {E[aj] ^ E[e°‘i]). However, because the transformation is 
monotone increasing, the quantiles (including the median) are invariant under the transformation. 

A fully Bayesian formulation of the model requires the specification of prior distributions 
for the model parameters Ajq) and Tx{j)- We employ the use of conjugate prior distributions for 
these parameters as in Glas and van der Linden (2001). Assume independent multivariate normal 
prior distributions for the family mean item parameter vectors ^x{j) i 

Aio)~A^(MA,f6.) (6) 



and independent inverse-Wishart prior distributions on the Tjqj’s, 

~ Wishart(Wi, W 2 ). 



( 7 ) 



The notation M ~ Wishart(Wi, W 2 ) implies that the density function of the p x p matrix M is 
proportional to 



|M|(’^i-p-i)/ 2 exp ^-^tr 



The prior in (7) implies that the prior mean of T~^.^ is W1W2, and that a priori there is 
information that is equivalent to Wi observations of the item parameter vector rjj. In most cases 
we suggest using a diagonal matrix for V\, the prior covariance matrix of the mean item parameter 
vectors; in the absence of prior information, the diagonal elements of the matrix should be large, 
e.g., V\ = 100/i^, where /ft- is the K x K identity matrix. 

One situation where it is sensible to use an informative prior is when the item family is a 
multiple-choice family. In that situation, a good choice for a prior distribution would be one that 
places its mass around the point px = of choices ^^^' example in the case 

where the item family has five choices we suggest using the prior mean pg = logit{0.2} = -1.386, 
and ag = 0.1. Figure 1 contains the density function of the transformed random variable 
^ ~ i+exp\-A ) ’ ''^here Xg ~ A/’(— 1.386, 0.1^). Note that the density is centered around 
0.2 = ivfo of choices ’ almost all of its mass in the interval (0.15,0.25). 
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Figure 1: Probability density function for the random variable Y = where 

Ap - A/"(-1.386,0.12). 




Y = (1+exp(~?^))-' 



Family Response Function and Family Score Function 

Note that by integrating the individual item parameter vectors rfj out of the item 
response functions Pj{6)^ the RSM defines a new set of family response functions (FRF). Let 
Pk{0 I X) denote family X’s response function for category /c, /c = 0, 1, . . . ^Kx — 1. The family 
response function for category k is defined by 

Pk{e\l)= f P,{e\rj)d^ri\\x,Tx), A: = 0, 1, . . . - 1, (8) 

Jt) 

where $(.|.,.) is the cumulative density function of the multivariate normal distribution and 
Pk{9 I rj) is the kf^ category response function and is defined by (1) or (2) depending on whether 
the item is an MC item or CR item. The family response function Pk{9 \ X) defines the probability 
that an individual with proficiency 9 will score A; on a randomly selected item from family X. 
Sinharay, Johnson, and Williamson (2003), in the context of multiple choice items, call the 
posterior expected function of FRF as family expected response function (FERF). 

Notice that an item family containing items with K score categories has the same number 
of FRF’s. It is often desirable to examine a single function for each item family. Therefore, we 
define the family score function (FSF) m(0|X), which describes the expected score on a randomly 
selected item from the X-th item family for an examinee with proficiency 0, as 

Kx-l 

m{e\I) = £ ^ X Pe{e\I), (9) 

e=o 
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where Pi{6\X) is defined in (8). A similar quantity exists for each individual item in each of the 



families. The item score function (ISF) for item j, defined as 



Kx-l 



rnM = ^ ^ X Pjeie), 



( 10 ) 



describes the expected score on item j of an individual with proficiency 0, where Pji{0)s are 
defined in (1) or (2) depending on whether the item is an MC item or CR item. 

Note that for a dichotomous item, the FSF becomes Pi{6\I) and the ISF becomes Pji{0). 

4. Bayesian Estimation for the Related Siblings Model 

Maximum likelihood estimation of the related siblings model requires the calculation of 
the joint likelihood function of the family parameters Aj(j)S and Tx{j)S given the observed data. 
Consistent estimation of these parameters would require marginalizing the likelihood with respect 
to both the examinee parameters 6 and the individual item parameters rj. The calculation of the 
family response function (FRF) in (8) demonstrated how the item parameters could be integrated 
out of the response function. Suppose A denotes the collection of all and T denotes the 

collection of all Tjq^s Now define the conditional likelihood of an examinee with proficiency 6 
given the family parameters Axy)S and Tj(j)S by taking the product over all item families 



where xj is the examinee’s score to an item from family X, X is the vector of the scores of the 
examinee to all items, and Pxxi^ I is define in (8). 

It is not enough to simply integrate 9 out of the examinee likelihood and take the 
product of the resulting terms to define the likelihood for the item family parameters. Doing so 
would require that item responses from different individuals to the different members of an item 
family are independent, when in fact they should be considered related to one another. However, 
by integrating the individual item parameters {aj^ out of the true joint likelihood, the 

resulting model correctly accounts for the fact that responses of two individuals to different items 
from the same item family are correlated even when conditioning on the family parameter Xx(j) 
and Tx(j). Maximizing the correct likelihood for the related siblings model would be an extremely 
difficult task requiring complex numerical integration techniques. 



L{e\ x,A,T) = l[p^^{e\i), 
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We prefer to perform a Bayesian estimation of the model. Bayesian estimation requires 
the determination of the joint posterior distribution of all model parameters given the observed 
data. Because the posterior distribution requires the the evaluation of an intractable integral, 
we employ Markov Chain Monte Carlo (MCMC; Gilks, et al., 1996) techniques, specifically the 
Metropolis-Hastings algorithm (MH; Metropolis, Rosenbluth, Rosenbluth, Teller and Teller, 1953; 
Hastings, 1970) within a Gibbs sampler (Geman and Geman, 1984). The algorithm generates 
a sample of parameter values from a Markov Chain that approximates the joint posterior 
distribution of the parameters of the model by drawing iteratively from the conditional posterior 
distribution of each model parameter. 

Recall the definition of rjj in (3). Item parameters a, 7 , 5 and the ability variables 9 
are drawn from their respective conditional distributions as described in Patz and Junker (1999). 
Conditional on the item parameters 



77 = collection of rjjs^ 

the I“th item family mean vector Ax and covariance matrix Tx are independent of 9 and the 
observed data X. The conditional distributions of the Ax’s, which are independent over the 
families (i.e., over k)^ are given by 

Xi\v,Tx ~ A/3 (Vx {V,- Va + , Vx) , (H) 



where 

Vj — {JxTx~^ + ^ , 

fjL\ and V\ are the prior mean and variances of Ax’s respectively, 

= x E 

j: X{j)=X 

and Jj is the number of members in item family I. 

The conditional distributions of the Tx’s, which are also independent over the families 
(i.e., over I), are given by 



-i> 



( 12 ) 



I 77, A ~ Inv-Wishart I Jx + ^ iVj ~ ^X{j))iVj ~ ^X{j)Y + ^ 

V (j. X{j)=X 

Hence the addition of the hierarchical component in the model amounts to additional sampling 
from normal and inverse Wishart distributions, which are both straightforward. Hence, the 
hierarchy of the model does not pose significant difficulties in the Bayesian estimation procedure. 
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Estimating the family response function. We use Monte Carlo integration to estimate 
the FRF defined in (8) and to estimate a 95% prediction interval for the family. The following 
steps describe the Monte Carlo procedure used for the estimation of the FRF and 95% prediction 
interval for the X-th family of items. 



i) Generate a sample of size M from the joint posterior distribution of the hyper-parameters 
Ax and Tx- That is, draw 

[A2;W,TxW]~F(Ax,Tx \X),t = l,...,M- I 



ii) 



For each of the above M values of the hyper-parameters [Aj\ Tj^] , draw n values of the item 
parameter vector rjj from the conditional (prior) distribution of r}j given and Tx^*\ 

~ ^(r?,|Ax,Tx) 



for r = 1, . . . , n and t = 1, ■ ■ • , M. 



iii) Set the ability at 6. 

iv) For each of the Mn draws obtained in step ii), compute the probability for category 

for each of the item categories i = 1, • • • , Kj, where Kj is the number 
of score categories for item family X. In addition calculate, the expected score function 

= Yli'X pP . 

v) The averages of the above probabilities and expected score functions are Monte Carlo esti- 
mates of the posterior means of the category FRF’s and the FSF, 

Mn 

E[Pe{e\I)\X] « 

t=l 

Mn 

E[m{9\I)\X] w 

t=l 

Sinharay, Johnson, and Williamson (2003) call the above estimated posterior means of FRFs 
as estimated FERFs while dealing with item families consisting of dichotomous items. 

vi) The 2.5^^ and 97.5^^ percentiles of the Mn probabilities and expected score functions 
form an approximate 95% prediction interval to attach with the estimates obtained in step 
(iv). This prediction interval reflects the within family variance as well as the uncertainty in 
the FRFs. 
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Steps iii) to vi) are repeated for a number of values of 9 to obtain the estimated FRF for each 
category £ in item family I. We use M = 1000, n = 10, and 100 equidistant values of 9 in the 
interval (-4,4) to estimate the function. Changing M and n does not result in any noticable 
difference. 

Note that the MCMC algorithm described earlier in this section provides a sample from 
an approximation of the posterior distribution. To obtain a sample of size M from the posterior 
distribution of Ax and Tx in step i) above draw a sub-sample of size M from the output of the 
MCMC algorithm. Step ii) is simple because it only requires sampling from a multivariate normal 
distribution using the draws of Ax and Tx from step (i). So the estimation of the FRF is quite 
straightforward given the output from the MCMC and takes little additional time. 

5. Example 1: A Simulation Study 

In this section, a simulation study examines the performance of the RSM on data 
generated according to a family structure. 

Generating Data with Family Structure 

The data generation process is very simular to that in Sinharay, Johnson, and Williamson 
(2003). We generate 16 item families: five families of two-category constructed response items 
(2PL), five families of multiple choice items (3PL), and six families of polytomous constructed 
response items (GPCM); two with three category items, two with four category items, and two 
with five category items. Each item family contains ten items (siblings). We generate the values 
of the individual proficiency parameters 9{S for N = 5000 examinees from a normal distribution 
with mean 0 and variance 1; each examinee receives one of ten “forms” of the test. Each form 
contains 16 items; one item from each of the 16 item families. The first items in each family make 
up the first form; the second item from each family make up the second form, et cetera. Exactly 
randomly selected examinees respond to each form. Because the items within one family are 
independent of the items in all other families, this design is not biased in any way. 

This study generates item families in such a way that the resulting item parameters reflect 
the range of values typically observed in real-life assessments. For example, from our experience, 
the discrimination parameters a^s usually fall between 0.5 and 1.5 and so the data generator 



draws item families so that the item parameters will remain in that range. The data generator 
draws individual item parameters so that the within-family variance is one-fifth as large as the 
between-family variance. Remembering that an RSM with very small within-family variance is 
essentially an ISM and an RSM with large within-family variance is essentially an USM (Sinharay, 
Johnson, Sz Williamson, 2003), the ratio of within-family variance and the between-family variance 
used here makes data distinguishable from data generated by an USM or ISM. 



Analysis of the Simulated Data 

To analyze the data we use a Wishart prior distribution for the family precision matrix 
with parameters W\ = %+ 1 and W 2 = where kj is the number of item parameters in 

the model for an item from family I (e.g. fcj = 3 for a three category item) and Ikj; is the kj x kj 
identity matrix. The prior mean of implied by the above choice of W\ and W 2 is ^kx’ 

We apply the MCMC algorithm to approximate the posterior distributions of the 
model parameters. For this example, a number of convergence diagnostics (time-series plots, 
Gelman-Rubin convergence diagnostics, and Brooks-Gelman multivariace potential scale reduction 
factor) indicate that a chain with 50,000 iterations is sufficient to ensure convergence. We discard 
the first 10,000 iterations as burn-in and use every 10th draw from the Markov chain, leaving us 
with 4,000 draws from the approximated posterior distribution of the model parameters. 

Figures 2 and 3 plot the approximated marginal posterior densities for the family 
parameters Xa and A5 respectively. Figure 4 contains the approximated posterior density for the 
family guessing parameter for the five 3PL item families. The vertical line in each panel of the 
figures represents the average value of the simulating item parameters in that family. 

Figure 3 indicates that the MCMC estimation algorithm does an excellent job of 
recovering the simulating values of A^s, the family difficulty parameters. In each of the 16 item 
families the simulating value is contained within the 95% credible interval for that parameter. 

Fourteen of the 95% credible intervals for the family discrimination parameters A^s also 
contain the simulating parameter values. Families 8 and 10 are the only two families whose 
credible intervals do not contain the true value. The discrimination parameter of the eighth family 
is underestimated, and Xa for the tenth family is overestimated. Further inspection reveals that 
these two families are multiple choice item families. And as is the case with simple item response 
models, multiple choice item families do not behave as well as constructed response item families. 
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Figure 2: Estimated posterior density functions of the family mean discriminations 
AflS for the simulation study. 
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Notice in Figure 4 that the family guessing parameter for the eighth family is 
underestimated and the guessing parameter for the tenth family is overestimated. Because of 
the near indeterminacy (especially for difficult items) of the 3PL model parameters, the overall 
effect of the under or overestimation of the family parameters for Families 8 and 10 is minimal. 
This is evident in Figure 5, which contains the family score functions (FSFs) for the 16 simulated 
families, along with the simulating item score functions (ISF) and the 95% prediction intervals for 
the families. 

The FSFs and 95% prediction intervals for families eight and ten track the simulating 
ISFs quite well; only small portions of a single ISF extend beyond the 95% intervals in each of 
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Figure 3; Estimated posterior density functions of the family mean difficulties Af,s for 
the simulation study. 
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these two families. The two families that cause the greatest concern are the seventh and fourteenth 
item families. 

The FSF for Family 7 clearly has an asymptote that is different from the simulating ISFs. 
Once again this can be explained by the near indeterminacy between the difficulty and guessing 
parameters in the 3PL when the item (or family) is an easy one and noting that Family 7 consists 
of very easy items. The model is unable to distinguish between an item that is easy and one that 
is easy to guess. 

The model performs poorly for Family 14. There is a single item whose ISF is almost 
completely outside of the 95% prediction interval for the family. It is quite clear that the within 



Figure 4: Estimated posterior density functions of the family mean asymptotes AgS 
for the simulation study. 
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Family 10 




family variance for this item is underestimated. Figure 6 contains the estimated posterior 
densities of the within family variance of the difficulty parameters (denoted as r^s) along with the 
simulating variance. 

The simulating variance for the fourteenth family is the largest variance across all item 
families, but the approximated posterior density for this family is not substantially different 
from the other fifteen families. This might be an indication that the amount of information in 
the observed data about this variance component is small relative to the amount of information 
provided by the prior distribution. This is not surprising given that the simulated data has only 
ten items per item family, which is probably too few for estimating the within family variances. 
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Figure 5: The estimated FSFs (solid bold lines), corresponding 95% prediction inter- 
vals (solid lines), and the simulating ISFs (dashed lines) for the 16 item families in the 
simulated data set. 
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Figure 6: Estimated posterior density functions of the within family variance com- 
ponent of the difficulties, for the simulation study 
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In addition to estimating family parameters, the MCMC algorithm also provides us with 
sampled values from the approximate posterior distribution of each examinee ability 6i. Figure 7 
contains the approximated posterior densities of five individual Oi^s — an individual with lowest 
raw score (0), an individual with the 25th percentile raw score (10), an individual with the median 
raw score (21), and an individual with highest raw score (28). The true values of the ^^’s are also 
shown using a vertical line. The 95% posterior credible interval contains the true simulating value 
of 6 in all the five cases. Although the 95% credible interval (—1.76, — 0.44) for the individual 
with a raw score of ten barely contains the simulating value of6 = —0.46, the posterior probability 
that the individuals 0 is greater than —0.46 is only Pr{0 > —0.46 | Xi} = 0.03. 
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Figure 7: Markov chain Monte Carlo approximated posterior densities for five of the 
simulated examinees. 
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6. Example 2: Analysis of the NAEP MathOnline Data 

The National Assessment of Educational Progress (NAEP) is an ongoing educational 
survey administered by the National Center for Education Statistics (NCES). NAEP regularly 
reports on the progress of students in fourth, eighth, and twelfth grade on a number of educational 
subjects (e.g. mathematics, reading). 

The Technology-Based Assessment (TBA) project is a NAEP special study sponsored by 
NCES. The project is designed to explore the use of technology, especially computers, as a tool to 
improve the quality and efficiency of assessments (NCES, 2002). One of the studies included in 
the TBA project is the Mathematics Online (MOL) special study (Sandene, Bennett, Braswell, 

&; Oranje, in press). The MOL study translates existing NAEP math questions into a computer 
delivery system to be used for the assessment of forth and eighth grade students. The main goals 
of the MOL study were to: (a) determine how computer delivery affects the assessment of the 
examinees, (b) evaluate the abilities of fourth and eighth grade students to use a computer based 
assessment, (c) investigate the ability to create alternate versions of the assessment with the use 
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of automatic item generation (AIG) (at grade 8 only). In the following pages we focus on the 
eighth grade MOL study. 



Data 

This paper looks at responses of 3793 examinees in grade 8, distributed among four test 
forms (denoted M2-M5). Each test form has a block of common items (denoted MP) and an 
additional 26 items that varies, either in content or delivery method, across the four forms. Of the 
26 items that varies across the forms, sixteen are multiple-choice items and ten are constructed 
response items. Of the ten constructed response items, five items have two response categories 
each, three have three response categories each and two have four response categories each. 

The items on form M2 are the “parent” items. These are all written by human, and 
representative of the NAEP mathematics item pool. Form M2 is a paper & pencil assessment 
much like the standard NAEP assessments (only shorter) with calculators provided for the 
students. 

The content of form M3 is identical to form M2. However, form M3 is a computer-based 
assessment form and students must use an online calculator. 

Forms M4 and M5 contain eleven items (six CR and five MC) that are identical to 
items on forms M2 and M3. The remaining fifteen items on forms M4 and M5 are automatically 

V 

generated (Singley & Bennett, 2002) from an item model based on the items on form M2. However 
these fifteen items are not the same on forms M4 and M5. So when compared to one another, 
forms M4 and M5 have eleven identical items and fifteen items that vary across the two forms. 
Forms M4 and M5 are paper & pencil forms with calculators provided where necessary like form 
M2. 

Table 1 summarizes the 26 item families; it provides the reader with the item types 
(e.g. 3PL, 3 category etc.) in each family and whether or not the items in M4 and M5 were 
automatically generated. 

Sinharay, Johnson, and Williamson (2003) analyzed the multiple choice items from this 
data set with Glas and van der Linden’s (1999) RSM. The later part of this section analyses this 
data set using the RSM introduced in Section 3 to demonstrate the practicality of the model. 
There are twenty-six item families, one corresponding to each item on form M2. Although some 
of the items are identical across the forms, we treat the items on each form as distinct items. 
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Table 1: The item-types and indicators of whether not the items are automatically 
generated for the item families for the MOL data set 



Family 


Type 


AIG items (Y/N) 


14 


3PL 


Y 


15 


2PL 


N 


16 


2PL 


N 


17 


2PL 


N 


18 


3PL 


N 


19 


3 Category 


Y 


20 


3PL 


Y 


21 


3PL 


Y 


22 


3PL 


N 


23 


3PL 


Y 


24 


3PL 


Y 


25 


3PL 


Y 


26 


4 Category 


N 



Family 




Type 


AIG items (Y/N) 


1 




3PL 


N 


2 




2PL 


Y 


3 


3 


Category 


Y 


4 




3PL 


N 


5 




3PL 


Y 


6 




3PL 


Y 


7 




2PL 


Y 


8 




3PL 


N 


9 




3PL 


Y 


10 


4 


Category 


N 


11 




3PL 


Y 


12 




3PL 


Y 


13 


3 


Category 


N 



The eleven item families for items that do not vary across the forms give us some idea about the 
amount of tolerable variation (Rizavi, Way, Davey, Sz Herbert, 2002), as the variation is simply 
a combination of sampling and administration variation. The analysis ignores two pieces of 
information that should be utilized in a full analysis of this data. The first is the common block 
of items in each form (MP). The second is whether the examinee completed the assessment online 
or with paper Sz pencil. 



Analysis 

We use the same Wishart prior distribution for as we did for the simulated data set. 
We approximate the posterior distribution of the model parameters using 100,000 iterations from 
an MCMC algorithm; the first 10,000 iterations treated are discarded, and the remaining 90,000 
iterations are thinned by selecting every 9th iteration for inclusion in the final sample of data from 




20 



BEST COPY AVAILABLE 



24 



the approximated posterior distribution. Convergence diagnostics are applied, as in the simulation 
study, to make sure that the MCMC algorithm converges. 

Figures 8 and 9 contain the estimated family score functions (along with the 95% 
prediction intervals) and the item score functions for each family; the former shows the families 
without any automatically generated items and the latter shows the families with automatically 
generated items. 

The item families without AIG items generally have a set of ISFs that are closer to the 
FSFs than families with AIG items. This is, of course, not surprising considering the fact a family 
without AIG items contains the same item appearing in different forms. Despite the generally 
close ISFs for item families without AIG items, there is some variation evident among the ISFs for 
these families. The greatest observed variation occurs in Families 13 and 26, where the item on 
Form M3 behaves different than the other three items, and in Family 10, where the ISFs appear 
to be quite different at the high end of the scale. 

Examination of the families that contain AIG items reveals a couple of clearly visible 
deviations. Most obvious is the fact that the entire family of items for family 9 is flat, suggesting 
students have the same random chance to correctly answer that question, regardless of their 
ability level. Since this is true for both the human generated item (appearing on form M2 and 
M3) and the AIG items (appearing on form M4 and M5), it appears that this is the result of a 
characteristic of the item type or content rather than the result of anything inherent in automatic 
item generation. In fact, in the operational analysis of this data set, this item has been dropped 
from the analysis (Sandene, Bennett, Braswell, & Oranje, in press). 

Family 5, also an AIG family, contains one ISF that is quite different from the other 
three. In this family, the manually generated items in M2 and M3 and the AIG item appearing in 
M5 all have very similar ISFs while the AIG item from block M4 deviates dramatically from the 
other three item ISFs in the family. The extent of the deviation appears to impact the response 
function for the family as a whole. 

Figure 10 contains the estimated posterior densities for five examinees from the Math 
Online study. Four of these examinees have varying raw scores (from very low to very high) — the 
estimated posterior distributions reflect that. The fifth examinee did not respond to any item, 
and hence the posterior we observe is simply the AT(0, 1) prior distribution. 
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Figure 8: Estimated FSFs, corresponding 95% prediction intervals (bold solid lines), 
and ISFs (lighter curves: small dashed lines for M2, dotted lines for M3, dots and 
dashes for M4, and long dashes for M5) for the 11 families that have no AIG items. 
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Figure 9: Estimated FSFs, corresponding 95% prediction intervals (bold solid lines), 
and ISFs (lighter curves: small dashed lines for M2, dotted lines for M3, dots and 
dashes for M4, and long dashes for M5) for the 15 families with AIG items. 
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Figure 10: The posterior densities of five examinees from the NAEP Math Online 
special study. 




7. Conclusions and Future Work 

Our work shows that when a test consists of item families, the RSM can take into account 
the dependence among the items belonging to the same item family. The MCMC algorithm for 
Bayesian model fitting allows us to include the additional parameters in the hierarchical model 
without much additional difficulty. This work also suggests a useful way to summarize the results 
for an item family using the FSF. We believe that this work is an important step in creating 
a statistical tool that can be used to analyze tests involving item families. Such tests require 
calibration of an item family only once; the items belonging to the same family may be used in 
future tests without going into the trouble of calibrating those items. This will be very useful in 
automatic item generation systems where items are automatically generated from item models. 
However, a lot of additional research is required prior to such operational applications. 

First among them is to find out the sample size required to achieve a pre-specified 
accuracy. It is clear that the model proposed is more complicated that a simple IRT model (USM) 
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and hence would require a larger sample size that what is required for an USM — it will be helpful 
to be able to provide some guidance as to “how large’’ the sample size should be. Also, given a 
specific number of examinees who will take a test involving item families, we will like to determine 
the optimum values of the number of siblings per family and the number of examinees per sibling. 

We would also like to study if the results of the analysis are sensitive to the prior 
distributions on the model parameters. Our analyses so far indicate that they are, especially the 
prior distributions on the hyperparameters corresponding to the within family variance when each 
item family consists of a few items. This is especially true with the MOL data set where there are 
only four items in each item family. 

Finally, it will be quite helpful to be able to include covariates in the model, either task 
feature variables or demographic variables. For example, our analysis of MOL data in Section 6 
ignored the facts that some examinees completed the assessment online while some others did it 
with paper and pencil and that some item families have AIG items while some do not. A model 
taking those facts into account might perform better than the one proposed here. 
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